Vispi Nevile Karkaria

Efficient Concurrent Programming with Python's ThreadPoolExecutor and as_completed

Published: August 28, 2023 | Author: Vispi Nevile Karkaria

How ThreadPoolExecutor and as_completed Can Help

ThreadPoolExecutor and as_completed in Python's concurrent.futures library offer an efficient way to manage concurrent tasks. They enable:

Better utilization of CPU cores, making your application faster and more efficient.
Improved execution time for I/O-bound tasks like file operations or network requests, as they don't have to wait for each other to complete.
A convenient API for handling asynchronous tasks, removing much of the complexity associated with concurrent programming.

Python Code Example

To demonstrate, here is a simple Python code snippet using ThreadPoolExecutor and as_completed:


        from concurrent.futures import ThreadPoolExecutor, as_completed

        def worker_function(x):
            return x * x

        # Create a ThreadPoolExecutor
        with ThreadPoolExecutor() as executor:
            # Submit tasks to the executor
            futures = {executor.submit(worker_function, i): i for i in range(5)}
            # Process results as they complete
            for future in as_completed(futures):
                print(future.result())

The code above creates a ThreadPoolExecutor and submits five tasks to it. The tasks are processed concurrently, and their results are printed as they become available.

Use Cases

Data Scraping

ThreadPoolExecutor can significantly expedite data scraping tasks by running multiple scrapers in parallel, thus reducing the time it takes to retrieve data from multiple sources.

Python Code Example


        from concurrent.futures import ThreadPoolExecutor
        import requests

        def fetch_data(url):
            return requests.get(url).text

        urls = ['https://example.com/page1', 'https://example.com/page2']

        # Create a ThreadPoolExecutor
        with ThreadPoolExecutor() as executor:
            # Fetch data from multiple URLs in parallel
            results = list(executor.map(fetch_data, urls))

This example demonstrates how to perform web scraping on two URLs concurrently. The `executor.map()` function is a convenient way to execute the `fetch_data` function on each URL in the `urls` list. The results are returned in a list.

Image Processing

Tasks like resizing, filtering, and transformation can be parallelized using ThreadPoolExecutor to improve the overall processing speed. This is particularly beneficial when you have to process a large number of images.

Python Code Example


        from PIL import Image
        from concurrent.futures import ThreadPoolExecutor

        def resize_image(image_path, output_path):
            img = Image.open(image_path)
            img = img.resize((300, 300))
            img.save(output_path)

        image_paths = ['image1.jpg', 'image2.jpg', 'image3.jpg']
        output_paths = ['image1_resized.jpg', 'image2_resized.jpg', 'image3_resized.jpg']

        with ThreadPoolExecutor() as executor:
            executor.map(resize_image, image_paths, output_paths)

This Python snippet uses the Pillow library to resize images concurrently. ThreadPoolExecutor's `map()` function handles the parallel processing, significantly speeding up the resizing operation.

API Calls

Making API calls sequentially can be a bottleneck in your application. By using ThreadPoolExecutor, you can make multiple API requests in parallel, thus saving a considerable amount of time.

Python Code Example


        import requests
        from concurrent.futures import ThreadPoolExecutor

        def fetch_data_from_api(api_url):
            return requests.get(api_url).json()

        api_urls = ['https://api.example.com/data1', 'https://api.example.com/data2']

        with ThreadPoolExecutor() as executor:
            results = list(executor.map(fetch_data_from_api, api_urls))

The above code demonstrates how to make API requests concurrently using ThreadPoolExecutor. The `executor.map()` function concurrently fetches data from the list of API URLs and stores the results in a list.

Tips and Caveats

Always remember to close the executor using the `with` statement to ensure that all resources are freed up.
Be cautious of thread safety, especially when dealing with shared resources like global variables or file systems.
Beware of the Global Interpreter Lock (GIL) in CPython, which can hinder true parallelism when performing CPU-bound tasks.
Adjust the number of threads based on your system's capabilities and the nature of your tasks. Excessive threads might lead to context switching overhead.

Conclusion

ThreadPoolExecutor and as_completed offer an efficient and convenient way to handle concurrent programming tasks in Python. By understanding their capabilities and limitations, you can significantly improve the performance and efficiency of various applications, be it data scraping, image processing, or API interactions.

I encourage you to experiment with these tools and find the right balance between concurrency and parallelism to meet your specific needs.

Efficient Concurrent Programming with Python's ThreadPoolExecutor and as_completed

How ThreadPoolExecutor and as_completed Can Help

Python Code Example

Use Cases

Data Scraping

Python Code Example

Image Processing

Python Code Example

API Calls

Python Code Example

Tips and Caveats

Conclusion

Further Reading

Leave a Comment